../../_images/badge-colab.svg ../../_images/badge-github-custom.svg ../../_images/badge-dimensions-api.svg

Extracting Patents that cite Publications from a chosen Research Organization

This tutorial shows how to extract and analyse patents information linked to a selected research organization, using the Dimensions Analytics API.

Load libraries and log in

[2]:
# @markdown # Get the API library and login
# @markdown **Privacy tip**: leave the password blank and you'll be asked for it later. This can be handy on shared computers.
username = ""  #@param {type: "string"}
password = ""  #@param {type: "string"}
endpoint = "https://app.dimensions.ai"  #@param {type: "string"}

# import all libraries and login
!pip install dimcli plotly_express -U --quiet
import dimcli
from dimcli.shortcuts import *
dimcli.login(username, password, endpoint)
dsl = dimcli.Dsl()
#
import os
import sys
import time
import json
import pandas as pd
from pandas.io.json import json_normalize
from tqdm import tqdm_notebook as progressbar
#
# charts lib
import plotly_express as px
if not 'google.colab' in sys.modules:
  # make js dependecies local / needed by html exports
  from plotly.offline import init_notebook_mode
  init_notebook_mode(connected=True)
DimCli v0.6.1.2 - Succesfully connected to <https://app.dimensions.ai> (method: dsl.ini file)

A couple of utility functions to simplify exporting CSV files to a selected folder

[3]:
#
# data-saving utils
#
DATAFOLDER = "extraction1"
#
if not os.path.exists(DATAFOLDER):
  !mkdir $DATAFOLDER
  print(f"==\nCreated data folder:", DATAFOLDER + "/")
#
def save_as_csv(df, save_name_without_extension):
    "usage: `save_as_csv(dataframe, 'filename')`"
    df.to_csv(f"{DATAFOLDER}/{save_name_without_extension}.csv", index=False)
    print("===\nSaved: ", f"{DATAFOLDER}/{save_name_without_extension}.csv")

Choose a GRID Research Organization

For the purpose of this exercise, we will are going to use grid.89170.37. Feel free though to change the parameters below as you want, eg by choosing another GRID organization.

[2]:

GRIDID = "grid.89170.37" #@param {type:"string"}

#@markdown The start/end year of publications used to extract patents
YEAR_START = 2000 #@param {type: "slider", min: 1950, max: 2020}
YEAR_END = 2016 #@param {type: "slider", min: 1950, max: 2020}

if YEAR_END < YEAR_START:
  YEAR_END = YEAR_START


#
# gen link to Dimensions
#

def dimensions_url(grids):
    root = "https://app.dimensions.ai/discover/publication?or_facet_research_org="
    return root + "&or_facet_research_org=".join([x for x in grids])

from IPython.core.display import display, HTML
display(HTML('---<br /><a href="{}">Open in Dimensions &#x29c9;</a>'.format(dimensions_url([GRIDID]))))


#@markdown ---

1 - Prerequisite: Extracting Publications Data

By looking at the Dimensions API data model, we can see that the connection between Patents and Publications is represented via a directed arrow going from Patents to Publications: that means that we should look for patents records where the publication_ids field contain references to the GRID-publications we are interested in.

Hence, we need to * a) extract all publications linked to one (or more) GRID IDs, and * b) use these publications to extract patents referencing those publications.

[3]:
# Get full list of publications linked to this organization for the selected time frame

q = f"""search publications
        where research_orgs.id="{GRIDID}"
        and year in [{YEAR_START}:{YEAR_END}]
        return publications[basics+category_for+times_cited]"""

pubs_json = dsl.query_iterative(q)
pubs = pubs_json.as_dataframe()

# save the data
save_as_csv(pubs, f"pubs_{GRIDID}")
1000 / 17126
2000 / 17126
3000 / 17126
4000 / 17126
5000 / 17126
6000 / 17126
7000 / 17126
8000 / 17126
9000 / 17126
10000 / 17126
11000 / 17126
12000 / 17126
13000 / 17126
14000 / 17126
15000 / 17126
16000 / 17126
17000 / 17126
17126 / 17126
===
Saved:  extraction1/pubs_grid.89170.37.csv

How many publications per year?

Let’s have a quick look a the publication volume per year.

[4]:
px.histogram(pubs, x="year", y="id", color="type", barmode="group", title=f"Publication by year from {GRIDID}")

What are the main subject areas?

We can use the Field of Research categories information in publications to obtain a breakdown of the publications by subject areas.

This can be achieved by ‘exploding’ the category_for data into a separate table, since there can be more than one category per publication. The new categories table also retains some basic info about the publications it relates to eg journal, title, publication id etc.. so to make it easier to analyse the data.

[5]:
# ensure key exists in all rows (even if empty)
normalize_key("category_for", pubs_json.publications)
normalize_key("journal", pubs_json.publications)
# explode subjects into separate table
pubs_subjects = json_normalize(pubs_json.publications, record_path=['category_for'],
                               meta=["id", "type", ["journal", "title"], "year"],
                               errors='ignore', record_prefix='for_')
# add a new column: category name without digits for better readability
pubs_subjects['topic'] = pubs_subjects['for_name'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))

Now we can build a scatter plot that shows the amount and distribution of categories of the years.

[6]:
px.scatter(pubs_subjects, x="year", y="topic", color="type",
           hover_name="for_name",
           height=800,
           marginal_x="histogram", marginal_y="histogram",
           title=f"Top publication subjects for {GRIDID} (marginal subplots = X/Y totals)")

2 - Extracting Patents linked to Publications

In this section we extract all patents linked to the publications dataset previously created. The steps are the following:

  • we loop over the publication IDs and create patents queries, via the referencing publication_ids field of patents

  • we collate all patens data, remove duplicates from patents and save the results

  • finally, we count patents per publication and enrich the original publication dataset with these numbers

[7]:
#
# the main query
#
q = """search patents where publication_ids in {}
  return patents[basics+publication_ids+FOR]"""


#
# useful libraries for looping
#
from dimcli.shortcuts import chunks_of
from tqdm import tqdm_notebook as progressbar

#
# let's loop through all grants IDs in chunks and query Dimensions
#
print("===\nExtracting patents data ...")
patents_json = []
BATCHSIZE = 400
VERBOSE = False # set to True to see patents extraction logs
pubsids = pubs['id']

for chunk in progressbar(list(chunks_of(list(pubsids), 400))):
    data = dsl.query_iterative(q.format(json.dumps(chunk)), verbose=VERBOSE)
    patents_json += data.patents
    time.sleep(1)


#
# put the patents data into a dataframe, remove duplicates and save
#
patents = pd.DataFrame().from_dict(patents_json)
print("Patents found: ", len(patents))
patents.drop_duplicates(subset='id', inplace=True)
print("Unique Patents found: ", len(patents))
# save
save_as_csv(patents, f"patents_{GRIDID}")
# turning lists into strings to ensure compatibility with CSV loaded data
# see also: https://stackoverflow.com/questions/23111990/pandas-dataframe-stored-list-as-string-how-to-convert-back-to-list
patents['publication_ids'] = patents['publication_ids'].apply(lambda x: ','.join(map(str, x)))


#
# count patents per publication and enrich the original dataset
#
def patents_per_pub(pubid):
  global patents
  return patents[patents['publication_ids'].str.contains(pubid)]

print("===\nCounting patents per publication...")

l = []
for x in progressbar(pubsids):
  l.append(len(patents_per_pub(x)))

#
# enrich and save the publications data
#
pubs['patents'] = l
save_as_csv(pubs, f"pubs_{GRIDID}_enriched_patents.csv")

===
Extracting patents data ...

Patents found:  4677
Unique Patents found:  3966
===
Saved:  extraction1/patents_grid.89170.37.csv
===
Counting patents per publication...

===
Saved:  extraction1/pubs_grid.89170.37_enriched_patents.csv.csv

A quick look at the data

[8]:
# display top 3 rows
patents.head(3)
[8]:
FOR publication_ids assignees publication_date year granted_year title times_cited id assignee_names filing_status inventor_names
0 None pub.1038200468,pub.1067588797,pub.1006473292,p... [{'id': 'grid.8767.e', 'city_name': 'Brussels'... 2017-09-14 2017 NaN CD20 BINDING AGENTS AND USES THEREOF 0 WO-2017153345-A1 [VIB VZW, UNIV GENT, UNIV BRUSSEL VRIJE] Application [TAVERNIER, JAN, VAN DER HEYDEN, José, DEVOOGD...
1 [{'id': '2921', 'name': '0912 Materials Engine... pub.1021290825,pub.1034244510,pub.1022541167 [{'id': 'grid.454156.7', 'city_name': 'Hsinchu... 2016-08-18 2015 NaN Semiconductor Devices Comprising 2D-Materials ... 11 US-20160240719-A1 [Taiwan Semiconductor Manufacturing Co (TSMC) ... Application [Meng-Yu Lin, Shih-Yen Lin, Si-Chen Lee, Samue...
2 [{'id': '2921', 'name': '0912 Materials Engine... pub.1073831920,pub.1034244510,pub.1021290825,p... [{'id': 'grid.454156.7', 'city_name': 'Hsinchu... 2018-01-02 2015 2018.0 Semiconductor devices comprising 2D-materials ... 0 US-9859115-B2 [Taiwan Semiconductor Manufacturing Co (TSMC) ... Grant [Meng-Yu Lin, Shih-Yen Lin, Si-Chen Lee, Samue...

Publications now have patents info:

[9]:
pubs.sort_values("patents", ascending=False).head(3)
[9]:
category_for times_cited pages year id issue volume author_affiliations type title journal.id journal.title journal patents
13542 [{'id': '2921', 'name': '0912 Materials Engine... 55 113-121 2003 pub.1008622444 2-3 4 [[{'first_name': 'Leonidas C', 'last_name': 'P... article High efficiency molecular organic light-emitti... jour.1047200 Organic Electronics NaN 237
16286 [{'id': '2921', 'name': '0912 Materials Engine... 5 341-348 2001 pub.1061127158 2 49 [[{'first_name': 'M.', 'last_name': 'Friedman'... article Low-Loss RF Transport Over Long Distances jour.1123356 IEEE Transactions on Microwave Theory and Tech... NaN 150
11610 [{'id': '2202', 'name': '02 Physical Sciences'... 4177 435-446 2005 pub.1012397724 6 4 [[{'first_name': 'Igor L.', 'last_name': 'Medi... article Quantum dot bioconjugates for imaging, labelli... jour.1031408 Nature Materials NaN 90

3 - Patents Data Analysis

Now that we have extracted all the data we need, let’s start exploring them by building a few visualizations.

How many patents per year?

[10]:
# PS is year correct as a patents field?
px.histogram(patents, x="year", y="id", color="filing_status",
             barmode="group",
             title=f"Patents referencing publications from {GRIDID} - by year")

Who is filing the patents?

This can be done by looking at the field assigness of patent. Since the field contains nested information, first we need to extract it into its own table (similarly to what we’ve done above with publications categories).

[11]:
# ensure the key exists in all rows (even if empty)
normalize_key('assignees', patents_json)
# explode assigness into separate table
patents_assignees = json_normalize(patents_json,
                                   record_path=['assignees'],
                                   meta=['id', 'year', 'title'],
                                   meta_prefix="patent_")
[12]:
top_assignees = patents_assignees.groupby(['name', 'country_name'],  as_index=False).count().sort_values(by="patent_id", ascending=False)
px.bar(top_assignees, x="name", y="patent_id",
       hover_name="name", color="country_name",
       title=f"Top Assignees for patents referencing publications from {GRIDID}")
[13]:
px.scatter(patents_assignees,  x="name", y="country_name",
           color="patent_year", hover_name="name",
           hover_data=["id", "patent_id"],  marginal_y="histogram",
           title=f"Assignees for patents referencing publications from {GRIDID} - Yearly breakdown")

What are the publications most frequenlty referenced in patents?

[14]:
pubs_cited = pubs.query("patents > 0 ").sort_values('patents', ascending=False).copy()
pubs_cited
[14]:
category_for times_cited pages year id issue volume author_affiliations type title journal.id journal.title journal patents
13542 [{'id': '2921', 'name': '0912 Materials Engine... 55 113-121 2003 pub.1008622444 2-3 4 [[{'first_name': 'Leonidas C', 'last_name': 'P... article High efficiency molecular organic light-emitti... jour.1047200 Organic Electronics NaN 237
16286 [{'id': '2921', 'name': '0912 Materials Engine... 5 341-348 2001 pub.1061127158 2 49 [[{'first_name': 'M.', 'last_name': 'Friedman'... article Low-Loss RF Transport Over Long Distances jour.1123356 IEEE Transactions on Microwave Theory and Tech... NaN 150
11610 [{'id': '2202', 'name': '02 Physical Sciences'... 4177 435-446 2005 pub.1012397724 6 4 [[{'first_name': 'Igor L.', 'last_name': 'Medi... article Quantum dot bioconjugates for imaging, labelli... jour.1031408 Nature Materials NaN 90
14007 [{'id': '2581', 'name': '0601 Biochemistry and... 1461 47-51 2003 pub.1036950132 1 21 [[{'first_name': 'Jyoti K.', 'last_name': 'Jai... article Long-term multiple color imaging of live cells... jour.1115214 Nature Biotechnology NaN 69
11331 [{'id': '2921', 'name': '0912 Materials Engine... 14 2207-2214 2005 pub.1061591798 10 52 [[{'first_name': 'Haizhou', 'last_name': 'Yin'... article Ultrathin Strained-SOI by Stress Balance on Co... jour.1019990 IEEE Transactions on Electron Devices NaN 67
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10248 [{'id': '2401', 'name': '0204 Condensed Matter... 34 2817-2823 2006 pub.1006364633 11 21 [[{'first_name': 'Chulho', 'last_name': 'Song'... article Optical enzymatic detection of glucose based o... jour.1357547 Journal of Materials Research NaN 1
10258 [{'id': '2202', 'name': '02 Physical Sciences'... 14 174410 2006 pub.1060619172 17 74 [[{'first_name': 'K. L.', 'last_name': 'Sauer'... article Spin dynamics in the pulsed spin locking of nu... jour.1320488 Physical Review B NaN 1
10290 [{'id': '2447', 'name': '0303 Macromolecular a... 43 21487-96 2006 pub.1056066440 43 110 [[{'first_name': 'Cynthia N', 'last_name': 'Ko... article Triarylphosphine-stabilized platinum nanoparti... jour.1031548 The Journal of Physical Chemistry B NaN 1
10317 [{'id': '2208', 'name': '08 Information and Co... 0 31-32 2006 pub.1047540916 4 10 [[{'first_name': 'Ian D.', 'last_name': 'Chake... article Mobile ad hoc networking and the IETF jour.1139676 ACM SIGMOBILE Mobile Computing and Communicati... NaN 1
17106 [{'id': '2867', 'name': '0906 Electrical and E... 114 1115-1126 2000 pub.1061215656 4 36 [[{'first_name': 'M.', 'last_name': 'Steiner',... article Fast converging adaptive processor or a struct... jour.1022254 IEEE Transactions on Aerospace and Electronic ... NaN 1

1050 rows × 14 columns

[15]:
px.bar(pubs_cited[:1000], color="type",
       x="year", y="patents",
       hover_name="title",  hover_data=["journal.title"],
       title=f"Top Publications from {GRIDID} mentioned in patents, by year of publication")

What are the main subject areas of referenced publications?

[16]:
THRESHOLD_PUBS = 1000
citedids = list(pubs_cited[:THRESHOLD_PUBS]['id'])
pubs_subjects_cited = pubs_subjects[pubs_subjects['id'].isin(citedids)]
[17]:
px.scatter(pubs_subjects_cited, x="year", y="topic", color="type",
           hover_name="for_name",
           height=800,
           marginal_x="histogram", marginal_y="histogram",
           title=f"Top {THRESHOLD_PUBS} {GRIDID} publications cited by patents - by subject area")

Is there a correlation from publication citations to patents citations?

Note: if the points on a scatterplot graph produce a lower-left-to-upper-right pattern (see below), that is indicative of a positive correlation between the two variables. This pattern means that when the score of one observation is high, we expect the score of the other observation to be high as well, and vice versa.

[18]:
px.scatter(pubs, x="patents", y="times_cited",
           title=f"Citations of {GRIDID} publications from publications VS from patents")

Conclusions

In this Dimensions Analytics API tutorial we have seen how, starting from a GRID organization, it is possible to extract

    1. publications from authors associated to this organization

    1. patents citing those publications (from any organization)

We have also done a basic analysis of the citing patents dataset, using fields like citation year, assigness etc…

This only scratches the surface of the possible applications of publication-patents linkage data, but hopefully it’ll give you a few basic tools to get started building your own application!



Note

The Dimensions Analytics API allows to carry out sophisticated research data analytics tasks like the ones described on this website. Check out also the associated Github repository for examples, the source code of these tutorials and much more.